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(57) Abstract: An interactive voice response system is described that supports full duplex data transfer to enable the playing of a 
voice prompt to a user of telephony system while the system listens for voice baige-in frcrni the user. The system includes a speech 
defection module that may utilize varicHis criteria ^och as frame enetgy magnitude and diuadon duesholds to detect speech. The 
system also includes an automatic speech recognition engine. When the automatic speech recognition engine recognizes a segment 
^ of speech, a feature exUaction module may be med to subtract a prompt echo spectrum, which corresponds to the cuiiendy playing 
voice prompt, from an echo-dirtied speech spectrum recorded by the system, in order to imptove spectrum subtraction, an estimation 
^ of the time delay between the echo-dinied speech and the prompt echo may also be performed. 
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VOICE BAKGE-IN IN TELEPHONY SPEECH RECOGNITION 

FIELD OF THE INVENTION 

The present invention relates to the field of speech recognition and, in 
particular, to voice barge-in for speech recognition based telephony applkattons. 

BACKGROUND OF THE INVENTION 

Speech recognition based tel^lumy systems are used by businesses to 
answer pl:u>ne calls with a system that engages users in natural language dUlog. 
These systems use interactive voice respcmse (IVR) teleplusny applications for a 
spoken language interface with a telephony ^stem* IVR applications enabte 
users to interrupt the syst«n output at any time^ for example^ if the output is 
based on an erroneous understanding of a user's inputor if it contains 
supei^uous information that a user does ru>t want to hear. Barge-in allows a 
user to interrupt a prompt being played using voice input Enabling barge-in 
may significantly enhance the user's eqper^ce by allowing tlw user to interrupt 
thesjrstem prompt, whenever desired, in order to save time. Without barge-iiv a 
user may react only when the system prompt completes, otherwise the user^s 
irqput is ignored by the system* This may be very inconvenient to the usar, 
particularly when ihe prompt is long and the user already knows the prompt 
message. 

In today's touch tcme based IVR systems, barge-in is widely adopted* 
However, for speedi recognition based IVR systems, barge-in poses to be a 
much greater challoige due to background noise and echo from a prompt that 
may be transmitted to a voice recognitkm system. 

One method of barge-in, r^ierred to as key barge-in, is to stop playing a 
prompt and be ready to process a user's speech after the user presses a special 
key, such as the ''IT or key. One problem with such a method is that tl^ iiser 
mustbeinfonnedof how touseit Assudi,anotherpromptmay need tobe 
added to the system, thereby undesirably incmasing the amount of user 
interaction time with the system* 
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Another method of barge-in, referred to as voice barge-in, enables a user 
to speak directly to the system to interrupt the prompt. Figure 1 iUustrates how 
barge-in occurs during prompt play in a voice barge-in system. Such a method 
uses speech detection to detect a user's speech wl,ile the prompt is playing. 
Once the user' speech is detected in the incoming data, the system stops playing 
and immediately begins a record phase in which the incoming data is made 

available to a sp^recognitionengipe. Hie speech recognition engipe 
processes the riser's speech. 

Although, such a method may provide a better solution than key barge- 
in, the voice barge-in function of current IVR systems has several problems. One 
problem with current lYR systems is that tiie computer-telephone cards used in 
these systems may not support full-duplex data transfer. Another problem witti 
current IVR systems is that they may not be able to delect speech robustly from 
background noise, non-speech sounds, irrelevant speech and/or prompt echo. 
For example, the prompt echo Aat resides in these systems may significantly 
degrade speech quaKty. Using traditional adaptive filtering methods to remove 
neaj>end prompt echo may significantly degrade the perfonnance of automatic 

speech recognition engines used in these systems. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example, and not by way of 

limitation, in the figures of the accompanying drawings and in which like 

zefexence numerals re£er to similar elements and in which: 

Figure 1 iUustrates barge-in during prompt play in a voice barge-in 

system. 

Figure 2 illustrates one embodiment of an interactive voice response 
telephony system. 

Hgure 3 iUustrates one embodiment of a method of implementing an 
interactive voice respcmse system. 

Figure 4 mtisHates one embodifiaent of speech detectibn in an 

signal. 

Figure 5 iUustrates one embodiment of a feature extraction method. 



wo 02/1152546 PCT/CN«0rtK»733 

Figure 6 illtiatrates an embodiment of a feature extraction method for a 
particular feature. 

DETAILED DESdUPTION 

In the following description, numerous specific details are set forth such 
as examples of specific systems, components, modules, etc. in order to provide a 
thorough understanding of the present invention. It wiU be apparent, however, 
to one skilled in the art that these specific details need not be employed to 
practice the present invention. In other instances, well known components or 
methods have not been described in detail in order to avoid unnecessarily 
obscuring the present invention. 

The present invention includes various steps, which will be described 
bek>w. The steps of the present invention may be performed by hardware 
components or may be embodied in machine-executable instructions, whidi may 
be used to cause a general-purpose or special-purpose processor programmed 
with the instructions to perform the steps. Alternatively, tiie steps may be 
performed by a combination of hardware and software. 

The present invention may be provided as a computer program product, 
or software, that may include a machine-readable medium having stored 
thereon instructions, which may be used to program a computer system (or 
other electronic devices) to perform a process according to the present invention. 
A machine readable medium includes any mechanism for storing or transmitting 
informatlan in a form <e.g., software) readable by a machine (e.g., a computer). 
The machine-readable medium may includes, but is not limited to, magnetic 
storage medium (e.g., floppy diskette); optical storage medium (e.g., CD-ROM); 
magneto-optical storage medium; read only memory (ROM); random access 
memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); 
flash memory; electrical^ optical, acoustical or olher fonn of propagated signal 
(e.g., carrier waves, infrared signals, digital signals, etc); or other type of 
medium suitable for storing electronic instructions. 

Figure 2 illustrates one embodiment of an interactive voice response 
telephony system. WR system 200 allows for a spoken laiiguage interface with 
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telephony system 290. TVR system 200 supports voice barge-in by enabling a 
user to interrupt a pron^ being played using voice input. In one embodiment, 
rVR system 200 includes interface module 205 and voice processing module 225. 
Interface module 205 provides interface circuitry for direct connection of voice 
processing module 225 with liite 203 carrying voice data. Line 203 may be an 
analog or a digital line. 

Interface module 205 includes voice input device 210 and voice output 
device220. Voice ii^t device 210 and voice output device 220 may be routed 
together using bus 216 to support full-duplex data transfer. Voice input device 
210 provides for voice data transfer from telephony system 290 to voice 
processing module 225. Voice output device 220 provides for voice data transfer 
from voice processing module 225 to telephony system 290. For example, voice 
output device 220 may be used to play a voice prompt to a user of telephony 
system 290 while voice input device 210 is used to listen for barge-in (e.g., voice 
or key) froma user. 

In one embodiment, for example, voice devices 210 and 220 may be 
Dialogic D41E cards, available from Dialogic Corporation of Parsippany, NJ. 

Dialogic's SOsus routing function may be used to establish communications 
between the Dialogic D41E cards. In alternative embodiment, voice devices 
from other manufacturers may be used, for example, cards available from 
Natural Microsystems of Framii^gham, MA. 

In one embodiment, voice processing module 225 may be implemented as 
asofhvarcprocesalngmodule. Voice processing module 225 includes speech 
detection module 230, feature extraction module 240, automatic speedi 
recognition (ASR) engine 250, and prompt generation module 260, Speech 

detection module 230 may be used to detect voice initiation in the data signal 
received from voice ii^ut device 210. Feature extraction module 240 may be 
used to extract features used by ASR engine 250 and remove prompt from input 
signal 204. A feahtre ia a representation of a speech signal that is suitable for 
automatic speech recognition. For example, a feature may be Mel-Fiequency 
Cepstmm Coeffidenis (MFCQ and their first and second order derivatives, as 
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discussed below in relation to Figure 6. As such, feature extraction may be used 
to obtain a speech feature from the original speech signal waveform. 

ASR engine 250 provides the function of speech recognition. Input 231 to 
ASR engine 250 contains vectors of speed .. ASR engine 250 outputs 241 a 
recognition resutt as a word string. When ASR engine 250 recognizes a segment 
of speech, according to a particular prompt tiiat is playing, feature extraction 
module 240 cleans up the speech containing data signaL For example, feature 
extraction module 240 may subtract the corresponding prompt echo's spertrum 
from the echo-dirtied speech spectrum. In one embodiment, ASR engine may 
be, for example, an Intel Speech Development Toolkit (ISDT) engine available 
from mtel Corporation of Santa aara,CA. In alternative embodiment, another 
ASR engine may be used, for example, ViaVoice available from IBM of Armonk, 
N.Y. ASR engines are known in the art; accordingly, a detailed discussion is not 

provided. 

Prompt generation module 260 generates prampts using a text-to-speed* 
(TIS) engine tiiak converts text input into speech output. For exan^le, the input 
251 to proDcqrt generation module 260 may be a sentence text and the output 261 
isaspeediwaveformofthesentencetext. ITS engines are available from 
industry manufacturers such as Lucent of Murray Hill, ^3J and Lemout & 
Hauspie of Belgium. In an alternative embodiment, a custom ITS engine may 
beused. TIS engines are known in the art; accordingly, a detailed discussion is 

iu>t provided. 

After prompt waveform is generated, prompt generation module 260 
plays a prompt through votee output device 220 to tiw user of telephony system 
290. It should be noted that in an alternative embodiment, fte operation of voice 
processing module 225 may be implemented in hardware, for example, is a 
digital signal processor. 

Referring again to speech detection module 230, in one embodiment, two 
criteria may be used to determine if input signal 204 contains spccdi. One 
criteria may be based on frame energy. A frame is a segment of input signal 204. 
Frame energy is the signal energy wiiiuntise segment. In one embodiment, if a 

e 
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segment of the detected input signal 204 contains speedi, then it mry be 
assumed that a certain number of frames of a running window of frames wUl 
have their energy levels above a predetenniru-d minimum energy threshold. 

ll»e window of frames may be either sequential or non-sequentiaL Iheenergy 
threshold may be set to account for energy iiom non^iesired speech, such as 
energy from prompt echo. 

In one embodiment, for example, a fiame may be set to be 20 milliseconds 
(ms), where speech is assumed to be short-time stationary up to 20ms; the 

number of ftames may be set to be 8 frames; and the running window may be set 
tobelOframes. If/in this running window, the energy of 8 frames is over the 
predetermined minimum energy threshold then tiie crnient time may be 
consideredasthestartpointofthespeech. The energy threshold may be based 
on, for exanq^le, an average energy of prompt echo that is the echo of pmmpt 
Oirrentlybeingplaycd. &i this manner, the frame «»eigy threshold may be set 
dynamically. According to different echos of prompt, the frame energy 
threshold may be set as fte average eneigy of the echo. The average energy of 
prompt echo may be pre«»mpulfid and stored when a prompt is added into 
system 200. 

Another criteria that may be used to determine if input signal 204 
contains speech is the duration of input signal 204. If the duration of iiq,ut 
signal 204 is greater than a predetermined value then it may be assumed that 
input signal 204 contains speech. For example, in one embodiment, it is 

assumed thatanyspeedieventlastsatleastaOOms. As such, &e duration value 
may be set to be 300 ms. 

After a possible start point of speech is detected, speech detection module 
230 attempts to detect the end point of die speech using the same method as 
detecting the start point. The start point and the end point of speech are used to 
calculate the duration. Continuing the example, if the speech duration is over 

300 ms tiien the possible start point of speech is a real speedi start point and the 
currait^wech frames and successive speech frames may be sent to feature 

extractionmodule240. Otherwise, the possible start point of speech is not a real 
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Start pok>t Of speech and speech detection i8 reset. This procedure lasts until an 
end point of speech is detected or input signal 204 is over a maximum possible 
length. 

Speech detection module 230 may also be used to estimate the time delay 
of the prompt echo in input signal 204 if an echo cancellation function of system 
200 is desired. WhUe a prompt is added in system 200, its waveform may be 
generated by prompt generation module 260. The waveform of die prompt is 
played once so that its echo is recorded and stored. When processing an input 
signal, correlation coefficients between input signal 204 and ttie stored prompt 
echo is calculated with the following equation: 

where C is the correlation coefficients; S is input signal 204, E is the prompt echo, 
T is *e echo lengflv and T is the time delay estimation o£ echo. Thevalueof 
T may range from zero to the maximum delay time (e.g., 200 ms). AfterCis 
computed, AemaximumvahieofCinaUTisfound. TWsvalueofT is the time- 
delay estimation of echo, llusvahac is used in the feature extraction module 240 
when performing spectrum subtraction of the prompt echo spectrum to remove 
pioir^t echo from the input signal 204 having edio dirtied speech, as diso^ 

below. 

Figure 3 iUustrates one embodiment of a metiiod of implementing an 
interactwe voice response system. A pioirq)t echo waveform 302 and an input 
signal 301 are received by the system. In one embodiment, a speech detection 
module may estimate the time delay time of a feahare in input signal 301, st^ 
310. The speech detection module may also be used to detect the existence of 
gpeechinir^tsignal302,instep320. Ihe existence of speech may be based on 
various criteria, such as amount of frame energy of tiie input signal and the 

duration of frame energy, as discussed below in relation to Figure 4. 

In step 330, feature extraction may be used to obtain a speech feature from 

the original speech signal waveform. Jn one embodiment, prompt echo may be 
removed from input signal 301, using spedxum subtraction, to fadUtate the 



wo 02/052546 

PCT/CN00A»0733 

recognition of speech in the input signal. After feature extraction is performed, 
speech recognition may be petfomied on input signal 301, step 340. A prompt 
may then be generated, step 350, based on the recognized speech. 

Rgure 4 iUustrates one emtodiment of speech detection in an input 
signal. In one embodiment, the frame energy of an input signal may be used to 
determine if the input signal contains speech. An assumption may be made Aat 
if the energy of the input signal, over a certain period of time, is above a certain 
threshold level then the signal may contain speech. 

Thus, in one embodiment, an energy threshold for the input signal may 
beset,sf»p410. Il» energy threshold is set higher than the prompt echo energy 
so that the system wiD not consider the energy of prompt echo in the ii^ut 
signal to be speech. In one embodiment, the energy threshold may be based on 
an average energy of the prompt echo that is the echo of ti»e prompt currently 
playing during the q^eech detection. Tlieeneigy of the input signal is measured 
over.a predetermined time period, step 420, and compared against the eneigy 
threshold. 

The input signal may be measured over time segments, or frames. In one 
embodiment, for example, a frame length of an input signal may be 20 
milliseconds in duration where speech is assumed to be a short-time stationary 
event up to 20 milliseconds. Instep430,thenumberof energy frames 
containing energy above the threshold is counted. If the energy of the ii^t 
signal over a predetermined number of frames (e.g., 8 frames) is greater ftan the 
predetemiined energy threshold, then the input signal may be considered to 
contain speech wiA duit point of time as the start of speech, step 440. 

In one embodiment, fiie energy of the input signal may be monitored over 
a running window of time- If in this running window (e.g., 10 frames) there is 
the predetermined number of frames (e.g., 8 frames) over the predetermined 
energy threshold, then that point of time may be considered as the start of 
speech. 

In an alternative embodiment, another method of detecting the start of 

speech may be used. For example, the rate of input signal energy crossing over 

-8- 
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the piedfiterminfid threshold may be calculated. If the measure rate exceeds a 
ptedetemune rate, such as a zero^oss threshold rate, tir«n die existence and 

start time of speech in the input signal may be determined. 

If no speech is detected in the input signal, then a determination may be 

made whether the period of silence (i.e., non-speech) is too long, step 445. If a 
predetermined silence period is not exceeded, then the system continues to 
monitor the ii^ut signal for speech, if the predetermined silence period is 
exceeded, ftien the system may end its listeaiing and take other actions, fb^ 

example, error processing (e.g., dose tiie cun«it caU), step 447. 

In one embodiment, the duration of frame energy of an input signal may 
also be used to determine if the input signal contains speech- A possible start 

point of speech is detected as described above in relation to steps 410 througji 
440. After a possible start point of speedi is detected, then die end point of ttie 
speech is detected to determine Ae duration of speech, step 450. Inone 
embodiment, the end point of speedi may be determined in a manner similar to 

that of delecting the possible start point of speech- For example, the energy of 
die irqjut signal may be measured over anottier predetermined time period and 

compared against the energy threshold. If the energy over the predetemuned 
time period is less than the energy threshold then the speech in the input signal 
may be considered to have ended, fii one embodiment, the predetermined time 
in the speech end point determination may be the same as the predetennined 
time in the speech start point detenninatian. In an alternative embodiment, the 
predetermined time in the speech end point determination inay be differ^ 

ttw predetermined time in the speech start point determination. 

Qnoe the end point of speech is determined, the duration of the speech is 
calculated, step 460. If the duration is above a predetermined duration 
threshold, then the possible start point of speech is a real speech start point step 
470. Inoneembodiment,fbrexample,thepredetermined duration threshold 
may be set to 300 ms where it is assumed that any anticipated speech event lasts 

for at least 300 ms. 
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Otherwise, the possible start point of speech is not a real start point of 
speech and the speech detection may be reset. IWs procedure lasts tmtU an end 
point of speech is detected or the input signal is over a maximum possible 
length, step 480. 

Figure 5 illustrates one embodiment of a feature extraction method. &i 
one embodiment, an ii^ut signal and a prompt echo wavefonn axe received, 
steps 515 and 525, respectively. A Fourier transfoimation is performed to obtain 
a speech spectnim&om the input signal, step 510. A Fourier transformatirai 

may also be perfiotmed on the echo waveform to generate a prompt echo 
spectrum, step 526. 

In one embodiment, the prompt echo spectrum is shifted according to a 
lime delay estimated between the input signal and the prompt echo waveform, 
step 519. The prompt echo spectrum is computed and subtracted ten the 
speech spectrum, step 520. Afterwards, the Cepstrum coefficients may be 

obtained for use by ASR engine 250 of Kguie 2 in performing speech 
recognitiQu, st^ 530. 

In one embodiment, feature extraction involves the cancellation of echo 
prompt from the ii^ut signal, as discussed below in relation to Hgure 6. When 
ASR engine 250 of Figure 2 recognizes a segment of speech, feature extraction 
may be used to subtract a prompt echo spectrum that corresponds to the 
currently playing prompt from echo^iirtiedq>eech spectrum, tiorderto 
improve spectrum subtraction, an estimation of the time delay between the echo- 

dirtied speech and fee recorded edw may be performed by speech detection 
module 230 of Figure 2. 

Figure 6 illustrates an embodiment of a feature extraction method for a 
particular feature. In one embodiment, Adel-Frequency Cepstrum Coefficients 
(MFCQ may be used to in performing speech recognition. UsingaMFCC 
generation procedure, a Hamming window is added to the frame segment set 
for speech (e.g., 20 ms), step 610. AFastFourierTransfomi(EFi:)is calculated to 
obtain the speech spectrum, step 620. If the echo spectrum subtraction function 
is enabled, shift the echo waveform according to the time delay then compute 

-10- 
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tiie echo spectrum and subtract the echo spectrum from the input sigaai 
spectrum, step 630. Next perform a logarithmic operation on the speech 
spectrum, step 640. Perform Mel-scale warping to reflect ttie non-linear 
perceptual characteristics of human hearing, step 650. Perform Inverse Discrete 
Time Transformation (nXTT) to obtain the Cepstrum coeffidents, step 660. The 
resulting feature is a multiple (e.g., 12 dimension) vector. These parameters 

form the base feature of MPCC 

In one embodiment, the first and second derivatives of the base feature 
are added to be tt»e additional dimensions (the IS* to 24* and 25* to 36* 
dimensions, respectively), to account for a change of speech over time. Byusing 

near^d prompt echo cancellation, the performance of the ASR en^ 250 of 
Kgure 2 may be improved. In one embodiment, for example, the perfonnance 
of the ASR engine 250 of Figure 2 may improve by greater than 6%. 

In ttte foregoing specification, the invention has been described with 
reference to specific exemplary embodiments thereof. It will, however, be 

evident that various modifications and changes may be made thereto wiAout 
departing from the b«>ader spirit and scope of the invention as set forth in the 

following claims. The specification and drawings are, accordingly, to be 

regarded in an iUustrative sense rather than a restrictive sense. 
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CLAIMS 

What is claimed is: 

1. A me&\od, comprising: 

detecting an existence of speech in an input signal; and 
removing a prompt echo from the input signal using spectrum 
subtraction. 

2. method of claim 1, wherein removirig the prompt edio 
comprises extracting a feature from input signal to generate a pharaHtjr of 
coefficients and wherein the method furflier comprises: . 

performing speech recognition on the input signal using the 
plurality of coefficients; and 

generating a pronqit in response to particular speech recognized in 
the input signaL 

3. "I^methodofclaiml, wherein theexistenceofspeechisdetect^ 
based on a predetermined energy in a plurality of segments of the input 
agnal. 

4. The method of daim 3, wherein the existence of speech is detected 
based on a predetermined duration of the plurality of segments having 
the predetermined energy. 

5. The method of daim 1, wherein detecting the existence of speech in 
the input signal comprises: 

setting aften«K^.tfaresholdfor theinput signal, the input signal 
having a plurality of segments; and 

detennining a start point of speech in the input signal, comprising: 

measuring an energy of the input signal for a 6rst 
predetermined time; and 

detennining whether the energy of the input signal for the 
first predetermined time is greater than the energy threshold. 
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6. Hie method of claim 5, wherein detecting the existence of speech in 
d\e input signal further comprises: 

measuring a duration of the energy above tfie energy threshold; 

and 

determining whether the duration is greater than a predetermined 
duration threshold. 

7. Tiie method of daiml, wherein removing the prompt echo from 

the input signal comprises: 

estimating a time delay between the input signal and the echo; 

obtaining a speech spectrum from fte input signal, ttw speech 
spectrum including the echo; 

shifting the echo according to the time delay; 

computing an echo spectrum using tiie shifted echo; and 

subtracting the echo spectrum from the speech spectrum. 

8. Hie method of daiml, wherein removing the prompt echo from 

liie input signal comprises: 

estimating a time delay between tiie input signal and ttie prompt 

echo; and 

removing tiie prompt echo from tiie input signal based on the time 

delay. 

9. The method of claim 8, wherein removing prompt echo comprises: 
generating a prompt echo spectrum; 

calculating a Fast Fourier Transform using the input signal to 

obtain a speech spectrum; 

subtracting the prooq>t echo spectrum from ttie speech qjectnam 

using the estimated time dllay. 

10. The method of claim?, further comprising: 
performing an inverse DCT to obtain Cq>stirum coeffidents; 

performing a logaritiim on the speech spectrum; and 

-13- 
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perfortning Mel-scale warping on the logarithm of the speech 
spectnim. 

11. A method^ comprising: 

setting an energy threshold for a signal having a plurality of 
segments; and 

determining a start point of speech in the signal/ comprising: 

meastiring an energy of ihe signal lor a first predetaroined 
time; and 

determining whether ti\e energy of the signal for the first 
predetermined time is greater than the energy threshold. 

12. The method of claim 11, wherein setting an energy threshold 
comprises: 

measuring a prompt echo energy; and 

setting the energy threshold above the prompt echo energy. 

13. The method of daim 11, further comprising determining an end 
point of the speech. 

- 14. The method of claim 13, wherein determining the end point of the 
q>eech comprises; 

measuring the energy of the signal for a second predetemuned 
time; and 

det^mining whether the energy of the signal for the second 
predetermined time is less than the energy thre^old. 

15, The method of claim 14, jurther comprising: 

calculating a duration based on the start point and the end point; 

and 

determining whether the duration is greater than a predetermined 
duration threshold. 
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16- The method of claim 14, wherein the first and second 
pi^etermined times are the same. 

17. The method of daim 13, wherein determining the end point of the 
speech comprises: 

measuring the energy of the signal for a second predetermined 

time; 

calcialating a rate of en«gy crossing over the mergy threshold; and 
determining whether the rate is greater than a predetermined rate* 

18. The method of daim 17, further comprising: 

calailating a duration based on the start point and the end point; 

and 

determining whether the duration is greater than a predetermined 
duration tiueshold. 

19. The method of daim 11, wherein the detertnixung whether the 
enei^ of the signal for the first predetennined time is greater than the 
enisegy threshold is performed over a running window of time* 

20. The method of daim 16, wherein the first predetermined number 
of segments is eight and the running window of segments is ten. 

21 • The method of daim 20, wherein a segment of the plurality of 
segments is 20 milliseconds. 

22. A machine readable medium having stored thereon instmctions, 

which when executed by a processor, cause the processor to perform the 

following, comprising: 

detecting the existesice of speech in an input signal; and 
removing a prompt echo from the input signal using spectrum 

subtraction. 
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23. The machine readable medium of daim 22, wherein reiroving the 
prompt echo from the input signal causes the processor to pertoim the 
following, comprising: 

estimating a time delay between the input signal and thi prompt 

echo; 

obtaining a speech spectrum from the ii^ut signal the speech 
spectrum including the prompt echo; 

shifting the prompt echo according to the time delay; 
computing a prompt echo spectrum using the shifted echo; and 
subtracting the prompt echo spectrum from Ae speech spectrum. 

24. The machine readable medium of daim 22, wherein the existence 
of speech is detected based on a predetermined energy in a pluraUty of 
segments of the input signal and a predetermined duration of the 
plurality of segments having the predetermined energy. 

25. A machine readable medium having stored therecm instructions, 
whidi when executed by a processor, cause the processor to perform the 
following, compriang: 

setting an energy threshold for a signal having a plurality of 
segments; and 

determining a start point of speediin the signal, comprising: 

measuring an energy of the signal for a first piedetennined 
time; and 

determining whether the energy of the signal for the first 
predetermined time is greater than the energy threshold. 

26. The machine readable medium of daim 25, wherein the processor 
further performs the following, comprising: " 

determining an end point of speedi in the signal, comprising: 
measuring the energy of the signal for a second 

predetermined time; and 
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determining whether the energy of tiie signal for the second 
predetermiried time is less ihan the energy threahold- 

27. The machine readable medimn of daim 25, wherein the processor 
further performs the following, comprising: 

calculating a duration based on tiie start point and the end point; 

and 

determining whether the duration is greater than a predetermined 
duration threshold. 

28. An apparatus, comprising: 

a voice processing module; and 

a voice interface device coupled to the voice processing module, 
the voice interface devi<» comprising 
a voice input device; and 

a voice ou^ut device coupled to the voice input device to 
support full-duplex data transfer between die voice interface device aiul a 
telephony system. 

29. The apparatus of daim 28, wherein the voice processing module 
comprises a digital signal processor. 

30. The apparatus of daim 28, wherein the voice processing module 
comprises processing software and wherein the processing software 
con^rises: 

a speech detection module; 

a jfeahue extraction module; 

an automatic speedi recognition engine; and 

a prompt generation module. 
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